The phased Solanum okadae genome and Petota pangenome analysis of 23 other potato wild relatives and hybrids

Potato is an important crop in the genus Solanum section Petota. Potatoes are susceptible to multiple abiotic and biotic stresses and have undergone constant improvement through breeding programs worldwide. Introgression of wild relatives from section Petota with potato is used as a strategy to enhance the diversity of potato germplasm. The current dataset contributes a phased genome assembly for diploid S. okadae, and short read sequences and de novo assemblies for the genomes of 16 additional wild diploid species in section Petota that were noted for stress resistance and were of interest to potato breeders. Genome sequence data for three additional genomes representing polyploid hybrids with cultivated potato, and an additional genome from non-tuberizing S. etuberosum, which is outside of section Petota, were also included. High quality short reads assemblies were achieved with genome sizes ranging from 575 to 795 Mbp and annotations were performed utilizing transcriptome sequence data. Genomes were compared for presence/absence of genes and phylogenetic analyses were carried out using plastome and nuclear sequences.


Background & Summary
Potato is currently the third most important crop for human consumption with increasing production in the developing world and it is an important sustainable global food security crop for climate-smart agriculture 1,2 .Breeding for improved nutritious potato varieties suitable for sustainable production and climate change resilience will be aided by introgression of diverse genes from native Andean cultivars and landraces along with wild Solanum relatives of potato 3 .The genebank at the International Potato Center (CIP) and other genebanks globally, carry diverse germplasm used by breeding programs for genetic improvement of the crop 4,5 .The CIP genebank has potato cultivars, landraces, and wild Solanum germplasm collected from a wide range of environments in North and South America with variation in abiotic and biotic stress.Native Andean cultivars and landraces also carry beneficial nutrients and bioactive compounds for human health 6,7 .
Potato and its wild relatives are in the genus Solanum section Petota, which consists of over 100 species that form tubers 21 .Increasing knowledge on the genetic potential of species in section Petota will increase capacity for their use in enhancing potato germplasm.Genomes of several wild species are also available [22][23][24][25] and a Petota super pangenome representing 60 species was recently released 26 .The current data set includes a phased genome assembly of the wild species S. okadae, for which short reads and the mitogenome were previously published 26,27 .In addition, the study contributes Illumina short read sequences (50-100x sequencing depth) and de novo assemblies (3,328-59,223 N50) for genomes of 16 wild species in section Petota that were noted for stress resistance and of interest to potato breeders.An S. etuberosum genome from outside section Petota was also included in this data set, as were three polyploid hybrids.For most of the species in this project, these are the first genome sequences released of their species.Taxonomic groupings of the species of section Petota into Clade 1 + 2, Clade 3 and Clade 4 were represented in the genomes sequenced.
Transcriptome sequence data was generated and used for annotation.Both nuclear and organellar genome sequences are provided.Comparative analysis of the presence/absence variation against the Petota super pangenome provided insight into the diversity of the species in the phylogenetic analysis.

Methods
Solanum okadae (OKA15) genome sequencing.Botanical S. okadae seed sourced from US Genebank NRSP-6 (PI 458367) were sown on agar media.Individual clones were propagated in vitro.Seventeen lines were planted in the greenhouse and self-pollinated.While abundant flowering was noted, a single line, OKA15, was selected for further study 26,27 .The 10X Genomics short reads were previously published 26,27 .High molecular weight DNA for S. okadae OKA15 was prepared using a modified CTAB procedure 28 and used for Hi-C (chromatin interactions), Pacific Biosciences (PacBio) and Nanopore sequencing.The Hi-C library preparation was done using Phase Genomics Proximo Plant Hi-C version 4.0, and sequenced at Génome Québec's Sequencing centre, Montreal, Canada, on Illumina NovaSeq.6000 platform in PE 100 bp mode with ~3B reads, while both the PacBio libraries (sequenced on Sequel II system using the SMRT technology (HiFi) at a depth of 100X,) and the Nanopore ONT libraries (sequenced on the PromethION at a depth of 100X) were prepared and sequenced by Novogene (Novogene Co., Ltd, Beijing, China).and other biotic and abiotic stresses were selected for this study (Table 1).The genomic DNA of these accessions was isolated from in vitro or greenhouse grown plantlets or leaves.The library construction, quality control, and sequencing were done by Novogene (Novogene Co., Ltd, Beijing, China).The genomic DNA was fragmented, ligated with adapters, and PCR amplified to construct single libraries with an insert size of 350 bps.The sequencing was performed on the Illumina NovaSeq.6000 platform, in a paired-end mode (2 X 150 bp) at ~50x or ~100x depth (Table 2).Similarly, total RNA was extracted from these accessions, including from OKA15, from leaf, shoot, flower, and/or tuber tissues.The RNA-Seq library construction used standard protocol (Novogene) and sequencing was done on the Illumina NovaSeq.6000 platform (PE150) (Table 2).S. okadae OKA15 phased genome assembly and quality control.The Nanopore reads were adapter trimmed using porechop v0.2.4 (https://github.com/rrwick/Porechop)and default parameters.hifiasm v0.16.1-r375 29 was used to generate a haplotype-resolved assembly using the PacBio HiFi and the Hi-C reads (Fig. 1).Organellar genomes were removed from both resulting haplotype assemblies.The HiFi and Nanopore reads were used to scaffold the haplotype assemblies.Hi-C reads were trimmed using fastp v0.23.1 30 , and mapped to haplotype assemblies individually using the Arima mapping pipeline (https://github.com/ArimaGenomics/mapping_pipeline) and default parameters.The alignments were filtered to keep only the unique alignments (samtools view -q 40) 31 .The scaffolding was performed using YaHS 32 .Hi-C maps were generated using Juicer 33 and manually reviewed using JBAT 33 .Any duplicated contigs from the assemblies were removed using mummer 34 .
Fig. 2 The k-mer coverage plots of each of 23 genome sequences from potato wild relatives (Solanum sp).
Gaps were closed using tgsgapcloser 35 with PacBio and Nanopore reads.A custom repeat library was constructed using Repeatmodeler 36 and repeats were masked using Repeatmasker 37 and default parameters.RNA-Seq reads were filtered using Kraken2 38 and trimmed using fastp 30 , then aligned to the assembly using hisat2 39 .Structural annotations were identified using Braker3 40 , and curated using gfacs 41 , and functional annotations were generated using ahrd (https://github.com/groupschoof/AHRD)and interproscan 42 .
Quality control and analysis of WGS and RNA-Seq for additional potato wild relatives.Quality control was performed on the raw sequencing data (WGS and RNA-Seq) using FastQC v0.11.9 43 .An adaptor removal and read quality trimming was performed on the raw reads using Trimmomatic v0.39 44 .Genome heterozygosity and estimated size were determined using the trimmed WGS reads.First, k-mer frequencies were calculated for each genome (k = 21) and a k-mer histogram was generated using Jellyfish v2.3.0 45 .The k-mer histogram was used to generate various genome characteristics with GenomeScope 2.0 46 .The polyploid genomes of S. brevicaule (=oplocense) X S. tuberosum (B1595106), S. tarnii X S. tuberosum (SH7_18_3), and S. commersonii X S. andigena (cxa70P5) hybrids showed higher rates of heterozygosity (Fig. 2), as seen in previous potato polyploid genomes 20 , while the S. etuberosum (etb0161) genome showed very low levels of heterozygosity.The estimated haploid genome sizes ranged from 575 Mbp in S. pinnatisectum (S7) to 795 Mbp in S. tarnii X S. tuberosum (SH7_18_3).
De novo assembly and annotation.Each genome was assembled into contigs from the raw Illumina reads using the de novo assembler MaSuRca v4.0.5 47 .Organelle sequences were removed from the resulting assemblies using publicly available potato plastome 48,49 and mitogenome sequences 27,50 .The de novo assemblies were aligned against the organellar reference using nucmer from Mummer v4.0.0beta2 utilities 34 and the alignments were filtered for 95% identity and 200 bp alignment length.Any contigs with >95% coverage against the organelles were removed, as well as organellar sequences at the start and ends of contigs.The resulting filtered contigs were queried against the non-redundant nucleotide database using blast+ v2.11.0 51 to remove contaminant sequences.Any contig with a reliable match (90% query converge with 90% sequence identity) to organisms outside of green plants were removed from the assembly.Finally, contigs that are less than 200 bp in length were removed and the remaining contig ids were modified to create a final clean assembly file using BBmap v38.86 52 (Table 3).The plastomes of each genome were assembled and annotated from raw Illumina reads using the Plastaumatic 53 pipeline.The quality of each de novo assembly was calculated using QUAST v5.0.2 54 to determine the assembly lengths, N50 values etc, and BUSCO v5.2.2 55 to check for the completeness of the assembly by looking for the presence of single copy orthologs from Viridiplantae (Fig. 3).An assembly was aligned against its reference genome 22 when available using nucmer from mummer v4.0.0beta2 34 and the alignment statistics were generated using dnadiff.The genome assemblies were annotated using the RNA-Seq data as the evidence for gene prediction along with homology-based prediction.The two accessions (B1595106 and SH7_18_3) with no RNA-Seq data were annotated only with the protein sequences.A repeat masking was performed on each genome assembly using RepeatMasker v4.1.2-pl 37with a custom repeat library of potato reference sequences constructed by concatenating repeat libraries from the Petota super pangenome 26 and DMv6.1 9 .Table 4 details the number of bases masked in each genome and size of different types of transposable elements found in them.Then the structural annotation was performed using BRAKER v2.1.6 40,56,57in two runs, one with the RNA-Seq data as evidence and another run with protein sequences as evidence.The trimmed RNA-Seq reads of each accession were aligned to its genome assembly using HiSAT2 v2.2.1 39 for BRAKER1 57 .Most of the accessions have an optimal RNA-Seq alignment rate (>65%) against their short read de novo assemblies.The alignment files were sorted and indexed using SAMtools v1.16.1 31 .Then the structural annotation was performed using the BRAKER pipeline [58][59][60][61][62][63] on the masked genome assembly with the alignment file.The BRAKER pipeline performed gene prediction by AUGUSTUS v3.4.0 64 and GeneMark-ET 65,66 to generate gene structural annotations.For the homology-based prediction (BRAKER2), Solanales protein sequences from OrthoDB along with the protein sequences from potato reference genomes 9,18,67 were combined to create a protein database for BRAKER.The BRAKER1 and BRAKER2 results were combined to select transcripts based on support by extrinsic evidence using TSEBRA v1.0.3 68 to get a high-confidence gene set.The annotations for the accessions with no RNA-Seq data (B1595106 and SH7_18_3) were obtained from the BRAKER2 run with protein sequences as evidence.Since no RNA-Seq evidence was available for these two, all the predicted structural genes were added to the final gene set.Figure 4 shows the number of genes and transcripts predicted in these accessions by the BRAKER pipeline.A total of 31,247-47,705 high confidence genes (i.e.genes with RNA-Seq evidence) were found in these genomes, with S. etuberosum having the lowest number of genes and the S. commersonii X S. andigena hybrid having the greatest number of genes.A high confidence gene set was not generated for B1595106 and SH7_18_3 genomes due to lack of the RNA-Seq data.A functional annotation was performed using Interproscan v5.52-86 42 against the PFAM database.Approximately 65-70% of the annotated genes in all the genomes were found to have PFAM domains, except for the SH7_18_3, and B1595106 genomes (for which RNA-Seq data was not available).

BUSCO Assessment Results
Fig. 3 BUSCO quality assessment results of short reads de novo genome assemblies for 24 potato wild relatives or hybrid clones (Solanum sp).The bar plot shows the percent of BUSCO genes (Viridiplantae dataset) present in each genome assembly.Species information and accession numbers are detailed in Table 1.
Fig. 4 The number of high-confidence genes and transcripts found in 22 potato wild relatives accessions (Solanum sp).The two columns for each genome represent the total number of predicted genes and transcripts (larger number on top of bar) along with their proportions of functional annotations (smaller number inside bar).Species information and accession numbers are detailed in Table 1.

Technical Validation
Sequencing data quality control.The quality of the raw WGS and RNA-Seq reads were checked using FastQC 43 .Various metrics such as per base sequence quality, overall read quality score, GC content, N content, and overrepresented sequences were analyzed to detect low quality sequences or biases in the data.We have found no issues or bias in the WGS reads, whereas the RNA-Seq data is found to have overrepresented sequences in some of the datasets, which is not unusual for RNA-Seq data.We have also checked for viral genome sequences in the RNA-Seq data since potato plants are frequently found to have various types of potato viruses.The reads were classified using kraken2 v2.1.2 38by searching against their own genome assemblies and potato virus sequences downloaded from NCBI.Most genomes have very few hits to the viral sequences, and we removed reads that have hits to viruses from each sample and used the clean reads in the following analyses.
Quality assessment of the genome assemblies.The de novo genome assembly quality was assessed by QUAST 54 and BUSCO 55 where contiguity and completeness of the assemblies were analyzed.The OKA15 genome assembly haplotype 1 and 2 have a complete BUSCO score of 98.3% and 98.5%, respectively, when compared to the (solanales_odb10).Merqury 259 was used to estimate the completeness (QV).The complete OKA15 assembly was 99.4%, while haplotype 1 and haplotype 2 were 65 and 65% complete individually (meaning 15-30% of the reads map equally well to both haplotypes).Haplotype 1 (hap1) has 214 contigs with a total length of 729 Mb.The largest contig is 84.7 Mb and the N50 is 58.6 Mb.Haplotype 2 (hap2) has 131 contigs with a total length of 726 Mb, the largest contig is 83.2 Mb and the N50 58.4 Mb.The heterozygosity of the oka15 genome was calculated using jellyfish 45 and found to be 0.87%.The S. etuberosum genome, which has been self-pollinated for several generations, has the best assembly of all with an N50 value of 59,223, 34,061 contigs, and 610 Mbp genome size.The three tetraploids (B1595106, SH7_18_3, and cxa70P5), and a diploid S. chacoense (chc0917) have fragmented assemblies (Table 2).Nine genome assemblies from this study were also compared with long-read assemblies of the same species from a recent study 22 .Overall, the amount of sequences that mapped to their individual reference ranged from ~85-99% with ~79-98% of the assembly coverage (Table 7).The BUSCO assessment of the assemblies revealed the majority of them are complete with presence of >80% complete genes (Fig. 2).Overall, the S. etuberosum genome has the highest percent of core plant orthologous genes with 94.1% of genes present as complete copies, followed by the chq762573 and pur1868 genomes with 92% and 89.6% of complete BUSCO genes.The SH7_18_3, cxa70P5, and B1595106 genomes have a higher percentage of fragmented genes among the 23 genomes.High rates of heterozygosity and higher ploidy levels are the major contributing factors to highly fragmented assemblies 38 .

Phylogenetic inference.
Previous studies classified section Petota (tuber-bearing) into major Clades and subgroups 13,26,260,261 .Here we have reconstructed phylogenetic trees from the PAV data (Fig. 5) as well as plastome sequences (Fig. 6) to understand the relationship between these genomes.pangenome 26 .S. okadae placed in the Clade 4 South as also seen in the Petota pangenome.The S. tarnii X S. tuberosum (SH7_18_3) genome, which is a somatic hybrid between S. tarnii (Clade 1 + 2 species) and S. tuberosum (Clade 4) obtained by in vitro fusions of protoplasts, placed differently in the PAV and plastome phylogenetic trees.In the plastome phylogenetic tree, it is grouped with the Clade 1 + 2 species, whereas, in the PAV tree it is grouped with another S. tuberosum hybrid (Clade 4) showing the phylogenetic difficulty with hybrids and the importance of including both nuclear and organellar data.Moreover, genomes that are known to be closely related are grouped together, and genomes representing the same species grouped together, which validates the use of PAVs in determining phylogenetic relationships 26 .

Usage Notes
The Solanum section Petota has over 100 cultivated and wild species.The potato crop wild relatives are an excellent source of genetic variation; however, not many sequencing efforts have been carried out to study these wild species.Here, we present the 24 phased chromosomes for S. okadae, and genome and transcriptome data for an additional 23 potato wild relative accessions.Though most of these are short-read assemblies, they are a  useful tool for exploring genetic diversity that can provide insight in potato pre-breeding.The WGS reads can be used in various analyses such as determining SNPs, structural variations, and PAVs to understand the genetic diversity that exists within the section Petota.The RNA-Seq data of these wild species with important agronomic traits is especially crucial to understand their transcriptome profile.S. okadae foliage has been shown to contain the glycoalkaloid tomatine, and undetectable levels of solanine and chaconine, and carries Colorado potato beetle resistance, drought and cold tolerance (but not frost) and moderate late blight resistance [262][263][264][265] .

4 S 2 S 1 S 6 S 2 S 28 S 7 S _ c o m m e rs o n ii_ S 3 S_lignicau 4 S 9 S5 8 4 7 SFig. 5 A
Fig. 5 A whole genome presence-absence variation (PAV) based phylogenetic tree showing the relatedness of potato wild relatives (Solanum sp).The S. etuberosum genome sequence was used as an outgroup.The numbers at the nodes represent bootstrap support values.For clarity of the figure, only key values are given, and the ones omitted are all 100.

1 SFig. 6 A
Fig.6A plastome sequence based phylogenetic tree of 24 potato wild relatives (Solanum sp).The S. etuberosum was used as an outgroup.The numbers at the nodes represent bootstrap support values.

Table 1 .
Genome and transcriptome sequencing.A total of 23 additional accessions representing potato wild relatives or hybrids with agronomically important traits such as disease resistance, cold and drought tolerance, A list of the 24 potato wild relatives (Solanum sp) accessions selected for this study with their detailed descriptions.

Table 2 .
Details of the tissues used and amount of data generated for WGS and RNA-Seq.The raw data is calculated as: number of reads*sequence length (150 bp).

Table 3 .
De novo genome assembly statistics and the heterozygosity information.

Table 4 .
7171with GTR2 + FO + R4 as a substitution model and 1000 bootstrap replicates and S. etuberosum set as an outgroup Results of the repeat analysis with different classes of TEs and their size.
The results have shown similar groupings as previous studies where S. pinnatisectum (Clade 1 + 2), S. paucissectum, S. piurae, and S. chiquidenum (Clade 3), and the remaining accessions (Clade 4) separated into different clades.Within Clade 4, the S. megistacrolobum accessions formed a sister clade to Clade 4 South, something which was also seen in the Petota

Table 5 .
A detailed list of genomic data descriptor records.It includes BioSample, SRA, genome, and plastome assembly accession numbers of each genome.

Table 6 .
A detailed list of transcriptomic data descriptor records.It includes BioSample and their respective SRA accession numbers of individual samples.

Table 7 .
Alignment statistics of nine genomes from this study compared against reference genomes.